Bucket map join is a special type of map join applied on the bucket
tables. To enable bucket map join, we need to enable the following
settings.
![left_semi_join](https://letsdobigdata.files.wordpress.com/2016/03/left_semi_join1.png?w=620)
+--------------+
¦ Id ¦ Data ¦
+----+---------¦
¦ 1 ¦ DataA11 ¦
¦ 1 ¦ DataA12 ¦
¦ 1 ¦ DataA13 ¦
¦ 2 ¦ DataA21 ¦
+--------------+
CREATE TABLE mytable (
name string,
city string,
employee_id int )
PARTITIONED BY (year STRING, month STRING, day STRING)
CLUSTERED BY (employee_id) INTO 256 BUCKETS
then hive will store data in a directory hierarchy like
/user/hive/warehouse/mytable/y=2015/m=12/d=02
So, you have to be careful when partitioning, because if you for instance partition by employee_id and you have millions of employees, you'll end up having millions of directories in your file system. The term 'cardinality' refers to the number of possible value a field can have. For instance, if you have a 'country' field, the countries in the world are about 300, so cardinality would be ~300. For a field like 'timestamp_ms', which changes every millisecond, cardinality can be billions. In general, when choosing a field for partitioning, it should not have a high cardinality, because you'll end up with way too many directories in your file system.
- SET hive.auto.convert.join=true; --default false
- SET hive.optimize.bucketmapjoin=true; --default false
![left_semi_join](https://letsdobigdata.files.wordpress.com/2016/03/left_semi_join1.png?w=620)
+--------------+
¦ Id ¦ Data ¦
+----+---------¦
¦ 1 ¦ DataA11 ¦
¦ 1 ¦ DataA12 ¦
¦ 1 ¦ DataA13 ¦
¦ 2 ¦ DataA21 ¦
+--------------+
CREATE TABLE mytable (
name string,
city string,
employee_id int )
PARTITIONED BY (year STRING, month STRING, day STRING)
CLUSTERED BY (employee_id) INTO 256 BUCKETS
then hive will store data in a directory hierarchy like
/user/hive/warehouse/mytable/y=2015/m=12/d=02
So, you have to be careful when partitioning, because if you for instance partition by employee_id and you have millions of employees, you'll end up having millions of directories in your file system. The term 'cardinality' refers to the number of possible value a field can have. For instance, if you have a 'country' field, the countries in the world are about 300, so cardinality would be ~300. For a field like 'timestamp_ms', which changes every millisecond, cardinality can be billions. In general, when choosing a field for partitioning, it should not have a high cardinality, because you'll end up with way too many directories in your file system.
No comments:
Post a Comment